Fine-tuning an LLM on your texts: part 4 - QLoRA

At last, it’s time to fine-tune your own LLM! We curated your dataset last week, and you’re now perfectly set up to fine-tune a Llama 2 LLM on your text messages, as I did over the holidays. We’re using QLoRA, and this week we’ll be writing more code than any of the previous installments. Let’s get to it!

QLoRA is a highly efficient approach to fine-tune an LLM that allows you to train on a single GPU, and it’s become ubiquitous for these sorts of projects. It combines 2 techniques:

Quantization, i.e. using an LLM with weight precision reduced to 4 bits
And, LoRA (low-rank adaptation of LLMs) which freezes model weights and injects new layers, vastly reducing the number of weights to be trained.

More on QLoRA in the reading list at the end of the first post. At this point, go back to Google Colab (or equivalent) and start with some installs and imports. I recommend you use a V100 GPU instance.

# installs

!pip install -q  torch peft bitsandbytes transformers trl accelerate sentencepiece cryptography wandb

# imports

import torch
from datasets import load_dataset, Dataset, DatasetDict
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig
from trl import SFTTrainer
from cryptography.fernet import Fernet
from getpass import getpass
from huggingface_hub import notebook_login
import os
import wandb

Hyper-parameters

Searching for hyper-parameters is a strange and bewildering activity, filled with traps, pitfalls and rabbit holes, and only occasional eureka moments. The careful approach is to start by changing 1 parameter at a time. Sadly, I lack the patience and self-control to do this, tweaking a whole bunch at the same time on a whim, then having to go back and redo when faced with contradictory results.

The good news: I’ve gone through this painful exercise for you, and emerged with hyper-parameters that worked well, at least for my dataset. I’d suggest you take these as your starting point. Fingers crossed they work well for you too out of the gate.

# SETUP

DATA_NAME = 'your-hf-username/messagesv1'
PROJECT_NAME = 'messages'
RUN_NAME = 'v1'
MAX_SEQ_LENGTH = 200
BASE_MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf" # Try 13b later
REFINED_MODEL_NAME = f"your-hf-username/{PROJECT_NAME}-{RUN_NAME}"

# HYPER-PARAMETERS

LORA_ALPHA = 64
LORA_R = 32
LORA_DROPOUT = 0.1
BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 4
LEARNING_RATE = 2e-4
LR_SCHEDULER_TYPE = 'cosine'
WEIGHT_DECAY = 0.001
TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# OTHER TRAINING CONFIG

STEPS = 100
SAVE_STEPS = 500
EVAL_STEPS = 1000

Hyper-parameter explanations

Here are brief descriptions of the hyper-parameters and the values that I ended up with:

LORA_ALPHA: this is a factor for how much the injected LoRA weight matrices are scaled; the rule of thumb is that this should be double the next hyper-parameter, r
LORA_R: this is the number of dimensions in the LoRA weight matrices. Many examples I’ve seen have LORA_ALPHA=32 and LORA_R=16, and I started with that, but I found these higher values worked better, perhaps because of the quantity of my training data. You should try both. Note that higher values of r increase training time.
LORA_DROPOUT: this is the dropout probability used in the LoRA layers; 0.1 seems the most common, but 0.05 is also used. I tried both, and 0.1 performed better.
BATCH_SIZE: I didn’t have much choice here; I ran out of GPU memory on the 16GB V100 if I tried a batch size higher than 1
GRADIENT_ACCUMULATION_STEPS: accumulating gradients for 4 batch steps seemed to work well
LEARNING RATE, LR_SCHEDULER_TYPE, WEIGHT_DECAY: you can visualize the effect of these on the Learning Rate during training in Weights & Biases. I found that these settings – which decreased the Learning Rate by a factor of 10 during training – performed better than 2e-4 or 2e-5 as a LR.
TARGET_MODULES: the specific modules to target in the model architecture for LoRA, and this is the most common choice – all the linear layers. In some situations lm_head is also targeted, when you’re looking to fine-tune on the language in the input, which should not be the case for us.

Next, sign in to Hugging Face and to Weights & Biases using your keys:

notebook_login()

wandb_key = getpass("Enter Weights & Biases Key")
wandb.login(key=wandb_key, relogin=True)

# set the wandb project where this run will be logged
os.environ["WANDB_PROJECT"] = PROJECT_NAME

# save your trained model checkpoint to wandb
os.environ["WANDB_LOG_MODEL"] = "true"

# turn off watch to log faster
os.environ["WANDB_WATCH"] = "false"

Load and decrypt your messages

Now you can load your messages from Hugging Face and decrypt them using the same key. This code prints the dataset structure and an example; you should check all is looking good.

# First load the dataset from Hugging Face

encrypted_data = load_dataset(DATA_NAME)

# Next, decrypt

key = getpass("Enter encryption key").encode()
f = Fernet(key)
decrypted_data = {'train':[], 'test':[]}
for split_name, split_list in decrypted_data.items():
  split = encrypted_data[split_name]
  for datapoint in split:
    old = datapoint['text']
    split_list.append(f.decrypt(old).decode('utf-8'))

# Finally, recreate the dataset

train_dataset = Dataset.from_dict({'text':decrypted_data['train']})
test_dataset = Dataset.from_dict({'text':decrypted_data['test']})
data = DatasetDict({'train':train_dataset, 'test':test_dataset})

print(data)
print(data['train'][0])
print(data['test'][0])

Showtime

We have reached the moment of truth. First we load the quantized base Llama 2 model for training:

# Model and tokenizer names
base_model_name = BASE_MODEL_NAME
refined_model = REFINED_MODEL_NAME

# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

# Quantization Config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map="auto"
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Finally, we run LoRA fine-tuning on the quantized model. If this runs out of memory, try reducing the LoRA r hyper-parameter (and alpha by the same factor) and reduce the target modules.

# LoRA Config
peft_parameters = LoraConfig(
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    r=LORA_R,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=TARGET_MODULES,
)

# Training Params
train_params = TrainingArguments(
    output_dir=REFINED_MODEL_NAME,
    num_train_epochs=1,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=1,
    evaluation_strategy="steps",
    eval_steps=EVAL_STEPS,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    optim="paged_adamw_32bit",
    save_steps=SAVE_STEPS,
    save_total_limit=10, # to avoid running out of disk space!
    logging_steps=STEPS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    report_to="wandb",
    run_name=RUN_NAME,
    push_to_hub=True,
    hub_model_id=REFINED_MODEL_NAME,
    hub_strategy="end",
    hub_private_repo=True
)

# Trainer
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=data['train'],
    eval_dataset=data['test'],
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    max_seq_length=MAX_SEQ_LENGTH,
    args=train_params
)

# Training
fine_tuning.train()

# Save Model
fine_tuning.model.save_pretrained(refined_model)

# Push model
fine_tuning.model.push_to_hub(REFINED_MODEL_NAME, private=True)

Each of these training runs took 8 hours for me, so I would typically leave it going overnight, and I’d wake up delighted to find my new model uploaded to Hugging Face.

Following along in Weights & Biases

While this is running, you should sign in to W&B and follow along. You should see your training loss come down quickly after the first few training steps. In some cases I was able to achieve continuing progress until near the end of training.

I only trained for 1 epoch; I tried out a 2nd epoch once, and there was significant over-fitting immediately with eval loss leaping up.

A note of caution: it can be very addictive to keep tweaking hyper-parameters to watch loss coming down. It can also be quite misleading. I found some situations where the loss decreased, but the quality of results was definitely worse. Bottom line: ultimately you need to look at text generation results to determine how well your model is performing.

And that’s the perfect segue to the next and final part, in which you will get to try out your model. We will generate text messages, acting as you or acting as your friends. Talking with yourself is a particularly mind-bending experience! I’m pretty sure you’ll find this as fun as I did, and well worth the time investment and the $100. The final installment is here.